Pivotting data

ANU BDSI
workshop
Data Wrangling with R Part 1

Emi Tanaka

Biological Data Science Institute

8th April 2024

Current learning objective

  • -Recognize the characteristics of tidy data
  • -Differentiate between the Base and Tidyverse paradigms
  • -Acquire the skills to add/modify columns, subset data by rows and columns, rename column names, and perform group operations using dplyr
  • Pivot data into longer or wider format using tidyr
  • -Join datasets using dplyr

Tidy data

Definition of a tidy data (Wickham, 2014)

  • Each variable must have its own column
  • Each observation must have its own row
  • Each value must have its own cell
Produced by OmniGraffle 7.18\n2020-11-23 23:57:30 +0000 Canvas 2 Layer 1 variables observations values
  • dplyr, tidyr and ggplot2 are downstream packages to work with tidy data

Wickham (2014) Tidy Data. Journal of Statistical Software

Evolving language

  • Earlier efforts to transform data from wide to long was in the reshape, first released on CRAN in 2005-08-05
  • It was then superseded by reshape2 released on CRAN in 2010-09-10
  • Then finally tidyr released on CRAN in 2014-07-21 *v1.0.0 released 2019-09-12

Wide to long

  • reshape::melt
  • reshape2::melt
  • tidyr::gather
  • tidyr::pivot_longer*

Long to wide

  • reshape::cast
  • reshape2::dcast
  • tidyr::spread
  • tidyr::pivot_wider*

Hadley Wickham (2020). tidyr: Tidy Messy Data. R package version 1.1.2.

Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20

Lifecycle

Canvas 1 Layer 1 experimental stable superseded deprecated

lifecycle: archived lifecycle archived lifecycle: defunct lifecycle defunct lifecycle: deprecated lifecycle deprecated lifecycle: experimental lifecycle experimental lifecycle: maturing lifecycle maturing lifecycle: questioning lifecycle questioning lifecycle: retired lifecycle retired lifecycle: soft-deprecated lifecycle soft-deprecated lifecycle: stable lifecycle stable lifecycle: superseded lifecycle superseded

  • Functions (and sometimes arguments of functions) in tidyverse packages often are labelled with a badge like on the left
  • Find the definitions of badges here
  • Check out documentations below
help(mutate, package = "dplyr")
help(mutate_each, package = "dplyr")

Lionel Henry (2020). lifecycle: Manage the Life Cycle of your Package Functions. R package version 0.2.0.

Pivotting data with tidyr Part 1

Data
df_wide <- tibble::tibble(state = c("NSW", "VIC", "ACT"),
                          `2019` = c(8130159, 6655284, 427892),
                          `2018` = c(80366651, 6528601, 423169),
                          `2017` = c(7919815, 6387081, 415874))
df_long <- pivot_longer(df_wide, cols = `2019`:`2017`, 
                        names_to = "year",
                        values_to = "population")
df_wide
state 2019 2018 2017
NSW 8130159 80366651 7919815
VIC 6655284 6528601 6387081
ACT 427892 423169 415874
pivot_longer(df_wide, 
             cols = `2019`:`2017`, 
             names_to = "year",
             values_to = "population")
pivot_wider(df_long, 
            id_cols = state,
            names_from = year, 
            values_from = population)
df_long
state year population
NSW 2019 8130159
NSW 2018 80366651
NSW 2017 7919815
VIC 2019 6655284
VIC 2018 6528601
VIC 2017 6387081
ACT 2019 427892
ACT 2018 423169
ACT 2017 415874

Values adapted from Australian Bureau of Statistics. (2020). Table 04. Estimated Resident Population, States and Territories [Time series spreadsheet]. National, state and territory population, Australia Mar 2020. Retrieved Nov 24, 2020. https://www.abs.gov.au/statistics/people/population/national-state-and-territory-population/mar-2020/310104.xls

Pivotting data with tidyr Part 2

Data
yield_long <- data.frame(year = c(1900, 1900, 2000, 1900, 1900, 2000, 2000),
                         state = c("Iowa", "Kansas", "Kansas", "Iowa", "Kansas", "Iowa", "Kansas"),
                         crop = c("barley", "barley", "barley", "wheat", "wheat", "wheat", "wheat"),
                         yield = c(28.5, 18, 35, 14.4, 18.2, 47, 37))
yield_wide <- pivot_wider(yield_long, names_from = crop, values_from = yield, names_glue = "{crop}_yield")
yield_long
year state crop yield
1900 Iowa barley 28.5
1900 Kansas barley 18.0
2000 Kansas barley 35.0
1900 Iowa wheat 14.4
1900 Kansas wheat 18.2
2000 Iowa wheat 47.0
2000 Kansas wheat 37.0
yield_wide
year state barley_yield wheat_yield
1900 Iowa 28.5 14.4
1900 Kansas 18.0 18.2
2000 Kansas 35.0 37.0
2000 Iowa NA 47.0

yield_long yield_wide

pivot_wider(yield_long, 
            id_cols = c(year, state),
            names_from = crop, 
            values_from = yield, 
            names_glue = "{crop}_yield") #<<

United States Department of Agriculture, National Agricultural Statistics Service. http://quickstats.nass.usda.gov/

Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17

Pivotting data with tidyr Part 3

Data
yield_long <- data.frame(year = c(1900, 1900, 2000, 1900, 1900, 2000, 2000),
                         state = c("Iowa", "Kansas", "Kansas", "Iowa", "Kansas", "Iowa", "Kansas"),
                         crop = c("barley", "barley", "barley", "wheat", "wheat", "wheat", "wheat"),
                         yield = c(28.5, 18, 35, 14.4, 18.2, 47, 37))
yield_wide <- pivot_wider(yield_long, names_from = crop, values_from = yield, names_glue = "{crop}_yield")
yield_long
year state crop yield
1900 Iowa barley 28.5
1900 Kansas barley 18.0
2000 Kansas barley 35.0
1900 Iowa wheat 14.4
1900 Kansas wheat 18.2
2000 Iowa wheat 47.0
2000 Kansas wheat 37.0
yield_wide
year state barley_yield wheat_yield
1900 Iowa 28.5 14.4
1900 Kansas 18.0 18.2
2000 Kansas 35.0 37.0
2000 Iowa NA 47.0

yield_wide yield_long

pivot_longer(yield_wide, 
        cols = contains("yield"), 
        names_to = "crop", 
        names_pattern = "(.+)_yield",  
        values_to = "yield", 
        values_drop_na = TRUE)

United States Department of Agriculture, National Agricultural Statistics Service. http://quickstats.nass.usda.gov/

Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17

Pivotting data with tidyr Part 4

crop_long
year state crop metric value
1900 Iowa barley yield 28.5
1900 Iowa barley acres 620,000.0
1900 Kansas barley yield 18.0
1900 Kansas barley acres 127,000.0
2000 Kansas barley yield 35.0
2000 Kansas barley acres 7,000.0
1900 Iowa wheat yield 14.4
1900 Iowa wheat acres 1,450,000.0
1900 Kansas wheat yield 18.2
1900 Kansas wheat acres 4,290,000.0
2000 Iowa wheat yield 47.0
2000 Iowa wheat acres 18,000.0
2000 Kansas wheat yield 37.0
2000 Kansas wheat acres 9,400,000.0
crop_wide
year state barley_yield wheat_yield barley_acres wheat_acres
1900 Iowa 28.5 14.4 620000 1450000
1900 Kansas 18.0 18.2 127000 4290000
2000 Kansas 35.0 37.0 7000 9400000
2000 Iowa NA 47.0 NA 18000

crop_long crop_wide

pivot_wider(crop_long, 
    names_from = c(crop, metric), 
    values_from = value, 
    names_glue = "{crop}_{metric}") 

United States Department of Agriculture, National Agricultural Statistics Service. http://quickstats.nass.usda.gov/

Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17

Pivotting data with tidyr Part 5

crop_long
year state crop metric value
1900 Iowa barley yield 28.5
1900 Iowa barley acres 620,000.0
1900 Kansas barley yield 18.0
1900 Kansas barley acres 127,000.0
2000 Kansas barley yield 35.0
2000 Kansas barley acres 7,000.0
1900 Iowa wheat yield 14.4
1900 Iowa wheat acres 1,450,000.0
1900 Kansas wheat yield 18.2
1900 Kansas wheat acres 4,290,000.0
2000 Iowa wheat yield 47.0
2000 Iowa wheat acres 18,000.0
2000 Kansas wheat yield 37.0
2000 Kansas wheat acres 9,400,000.0
crop_wide
year state barley_yield wheat_yield barley_acres wheat_acres
1900 Iowa 28.5 14.4 620000 1450000
1900 Kansas 18.0 18.2 127000 4290000
2000 Kansas 35.0 37.0 7000 9400000
2000 Iowa NA 47.0 NA 18000

crop_wide crop_long

pivot_longer(crop_wide, 
      cols = -c(year, state),
      names_to = c("crop", "metric"), 
      names_pattern = "(.+)_(.+)",
      values_to = "value")

United States Department of Agriculture, National Agricultural Statistics Service. http://quickstats.nass.usda.gov/

Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17

Separate values into columns

Data
pkg_dat <- tribble(~package, ~maintainer,
        "dplyr", "Hadley Wickham",
        "magrittr", "Lionel Henry",
        "tidyr", "Hadley Wickham",
        "stringr", "Hadley Wickham",
        "rlang", "Lionel Henry",
        "tibble", "Kirill Müller",
        "tidyselect", "Lionel Henry") %>% 
  arrange(package)
pkg_dat
package maintainer
dplyr Hadley Wickham
magrittr Lionel Henry
rlang Lionel Henry
stringr Hadley Wickham
tibble Kirill Müller
tidyr Hadley Wickham
tidyselect Lionel Henry

🎯 separate maintainer name to columns, first name and last name

pkg_dat %>% 
  separate(maintainer, 
           into = c("first_name", "last_name"),
           sep = " ")
# A tibble: 7 × 3
  package    first_name last_name
  <chr>      <chr>      <chr>    
1 dplyr      Hadley     Wickham  
2 magrittr   Lionel     Henry    
3 rlang      Lionel     Henry    
4 stringr    Hadley     Wickham  
5 tibble     Kirill     Müller   
6 tidyr      Hadley     Wickham  
7 tidyselect Lionel     Henry    

Separate values into rows

Data
author_dat <- tribble(~package, ~author,
        "dplyr", "Hadley Wickham, Romain François, Lionel Henry, Kirill Müller",
        "magrittr", "Lionel Henry, Stefan Milton Bache, Hadley Wickham",
        "tidyr", "Hadley Wickham",
        "stringr", "Hadley Wickham",
        "rlang", "Lionel Henry, Hadley Wickham",
        "tibble", "Kirill Müller, Hadley Wickham",
        "tidyselect", "Lionel Henry, Hadley Wickham") %>% 
  arrange(package)
author_dat
package author
dplyr Hadley Wickham, Romain François, Lionel Henry, Kirill Müller
magrittr Lionel Henry, Stefan Milton Bache, Hadley Wickham
rlang Lionel Henry, Hadley Wickham
stringr Hadley Wickham
tibble Kirill Müller, Hadley Wickham
tidyr Hadley Wickham
tidyselect Lionel Henry, Hadley Wickham
author_dat %>% 
  separate_rows(author, sep = ", ")
# A tibble: 15 × 2
   package    author             
   <chr>      <chr>              
 1 dplyr      Hadley Wickham     
 2 dplyr      Romain François    
 3 dplyr      Lionel Henry       
 4 dplyr      Kirill Müller      
 5 magrittr   Lionel Henry       
 6 magrittr   Stefan Milton Bache
 7 magrittr   Hadley Wickham     
 8 rlang      Lionel Henry       
 9 rlang      Hadley Wickham     
10 stringr    Hadley Wickham     
11 tibble     Kirill Müller      
12 tibble     Hadley Wickham     
13 tidyr      Hadley Wickham     
14 tidyselect Lionel Henry       
15 tidyselect Hadley Wickham     

Summary

Exercise time

10:00